AV signal processing apparatus for detecting a boundary between scenes, method, recording medium and computer program therefor

ABSTRACT

The invention provides an AV signal processing apparatus and method by which a boundary between scenes is detected so that recorded video data can be played back beginning with an arbitrary scene. First, video data inputted is divided into video segments or audio segments or, if possible, into both of video and audio segments. Then, feature amounts representative of features of the segment are calculated, and then, similarity measurement between segments is performed using the feature amounts. Thereafter, it is discriminated whether or not the segment corresponds to a break of a scene. Thus, the video-audio processing apparatus uses the dissimilarity measurement criterion and the feature amounts calculated as above to determine, regarding each segment as the reference segment at present, in which one of the past and the future with respect to the reference segment the ratio of presence of neighboring similar segments is higher, and investigates the pattern of the change of the ratio to discriminate whether or not the reference segment corresponds to a boundary of a scene.

BACKGROUND OF THE INVENTION

This invention relates to an AV signal processing apparatus and methodas well as a recording medium, and more particularly to an AV signalprocessing apparatus and method as well as a recording medium. suitablefor use to select and play back a desired portion from a series of avideo signal.

It is sometimes desired to search for and play back a desired portionsuch as an interesting portion from within a video application composedof a large amount of different video data such as, for example,television broadcasts recorded as video data.

One of conventional techniques for extracting desired video contents inthis manner is a storyboard which is a panel formed from a series ofvideos which represent major scenes of an application. The storyboarddisplays videos representing individual shots into which video data aredivided. Almost all of such video extraction techniques automaticallydetect and extract shots from within video data as disclosed, forexample, in G. Ahanger and T. D. C. Little, “A survey of technologiesfor parsing and indexing digital video”, J. of Visual Communication andImage Representation 7, 28-4, 1996.

However, for example, a representative television broadcast for 30minutes includes hundreds of shots. Therefore, in the conventional videoextraction technique described above, a user must check a storyboard onwhich a very great number of extracted shots are juxtaposed, and whenthe user tries to recognize the storyboard, a very heavy burden isimposed on the user.

The conventional video extraction technique is further disadvantageousin that, for example, shots of a scene of conversation obtained byimaging two persons alternately depending upon which one of the personstalks include many redundant shots. In this manner, shots are very lowin hierarchy as an object of extraction of a video structure and includea great amount of wasteful information, and the conventional videoextraction technique by which such shots are extracted is not convenientto its user.

Another video extraction technique uses very professional knowledgeregarding a particular contents genre such as news or a football game asdisclosed, for example, in A. Merlino, D. Morey and M. Maybury,“Broadcast news navigation using story segmentation”, Proc. of ACMMultimedia 97, 1997 or Japanese Patent Laid-Open No. 136297/1998.However, although the conventional video extraction technique canprovide a good result in regard to an object genre, it isdisadvantageous in that it is not useful to the other genres at all andbesides it cannot be generalized readily because its application islimited to a particular genre.

A further video extraction technique extracts story units as disclosed,for example, in U.S. Pat. No. 5,708,767. However, the conventional videoextraction technique is not fully automated and requires an operation ofa user in order to determine which shots indicate the same contents. Theconventional video extraction technique is disadvantageous also in thatcomplicated calculation is required for processing and the object of itsapplication is limited only to video information.

A still further video extraction technique combines detection of shotswith detection of a no sound period to discriminate a scene asdisclosed, for example, in Japanese Patent Laid-Open No. 214879/1997.The video extraction technique, however, can be applied only where a nosound period corresponds to a boundary between shots.

A yet further video extraction technique detects repeated similar shotsin order to reduce the redundancy in display of a storyboard asdisclosed, for example, in H. Aoki, S. Shimotsuji and O. Hori, “A shotclassification method to select effective key-frames for videobrowsing”, IPSJ Human Interface SIG Notes, 7: 43-50, 1996. Theconventional video extraction technique, however, can be applied only tovideo information but cannot be applied to audio information.

The conventional video extraction techniques described above furtherhave several problems in incorporating them into apparatus for domesticuse such as a set top box or a digital video recorder. This arises fromthe fact that the conventional video extraction techniques areconfigured supposing that post-processing is performed. Morespecifically, they have the following three problems.

The first problem resides in that the number of segments depends uponthe length of contents, and even if the number of segments is fixed, thenumber of shots included in them is not fixed. Therefore, the memorycapacity necessary for scene detection cannot be fixed, andconsequently, the required memory capacity must be set to an excessivelyhigh level. This is a significant problem with apparatus for domesticuse which have a limited memory capacity.

The second problem resides in that apparatus for domestic use requirereal-time processing to complete a determined process within adetermined time without fail. However, since the number of segmentscannot be fixed and post-processing must be performed, it is difficultto always complete a process within a predetermined time. This signifiesthat, where a CPU (central processing unit) which does not have a highperformance and is used in apparatus for domestic use must be used, itis further difficult to perform real time processing.

The third problem resides in that, since post processing is required asdescribed above, processing of scene detection cannot be completed eachtime a segment is produced. This signifies that, if a recording state isinadvertently stopped by some reason, an intermediate result till thencannot be obtained. This signifies that sequential processing duringrecording is impossible and is a significant problem with apparatus fordomestic use.

Further, with the conventional video extraction apparatus describedabove, when a scene is to be determined, a method which is based on apattern of repetitions of segments or grouping of segments is used, andtherefore, a result of scene detection is unique. Therefore, it isimpossible to discriminate whether or not a boundary detected is anactual boundary between scenes with high possibility, and the number ofdetected scenes cannot be controlled stepwise.

Further, in order that videos can be seen easily, it is necessary tominimize the number of scenes. Therefore, a problem occurs that, wherethe number of detected scenes is limited, it must be discriminated whatscenes should be displayed. Therefore, if the significance of each sceneobtained is determined, then the scenes may be displayed in accordancewith the order of significance thereof. However, the conventional videoextraction techniques do not provide a scale to be used for measurementof the degree of significance for each scene obtained.

SUMMARY OF THE INVENTION

It is an object of the present invention to provide an AV signalprocessing apparatus and method as well as a recording medium by which aboundary between scenes is detected so that recorded video data can beplayed back beginning with an arbitrary scene.

In order to attain the object described above, according to an aspect ofthe present invention, there is provided an AV signal processingapparatus for detecting and analyzing a pattern which reflects asignificance structure of contents of an AV signal supplied thereto todetect a scene of a significant break, including feature amountextraction means for extracting feature amounts of segments each formedfrom a series of frames which form the AV signal, calculation means forcalculating a measurement criterion to be used for measurement of asimilarity of the feature amounts between a reference segment and othersegments, similarity measurement means for using the measurementcriterion to measure the similarity between the reference segment andthe other segments, measurement value calculation means for using thesimilarity measured by the similarity measurement means to calculate ameasurement value indicative of a possibility that the reference segmentmay be a boundary of the scene, and boundary discrimination means foranalyzing a variation of a pattern with respect to time of themeasurement value calculated by the measurement value calculation meansand discriminating based on a result of the analysis whether or not thereference segment is the boundary of the scene.

The AV signal may include at least one of a video signal and an audiosignal.

The AV signal processing apparatus may further include intensity valuecalculation means for calculating an intensity value indicative of adegree of the variation of the measurement value corresponding to thereference segment.

The measurement value calculation means may calculate similar segmentsin a predetermined time area with respect to the reference segment,analyze the time distribution of the similar segments and determine aratio at which the similar segments are present in the past and in thefuture to calculate the measurement value.

The boundary discrimination means may discriminate based on a sum totalof the absolute values of the measurement values whether or not thereference segment is the boundary of the scene.

The AV signal processing apparatus may further include audio segmentproduction means for detecting, when the AV signal includes a videosignal, a shot which is a basic unit of a video segment to produce theaudio segment.

The AV signal processing apparatus may further include audio segmentproduction means for using, when the AV signal includes an audio signal,at least one of the feature amount of the audio signal and a no soundperiod to produce an audio segment.

The feature amounts of the video signal may at least include a colorhistogram.

The feature amounts of the video signal may at least include at leastone of a sound volume and a spectrum.

The boundary discrimination means may compare the measurement value witha preset threshold value to discriminate whether or not the referencesegment is a boundary of the scene.

According to another aspect of the present invention, there is providedan AV signal processing method for an AV signal processing apparatus fordetecting and analyzing a pattern which reflects a significancestructure of contents of an AV signal supplied thereto to detect a sceneof a significant break, comprising a feature amount extraction step ofextracting feature amounts of segments each formed from a series offrames which form the AV signal, a calculation step of calculating ameasurement criterion to be used for measurement of a similarity of thefeature amounts between a reference segment and other segments, asimilarity measurement step of using the measurement criterion tomeasure the similarity between the reference segment and the othersegments, a measurement value calculation step of using the similaritymeasured by the processing in the similarity measurement step tocalculate a measurement value indicative of a possibility that thereference segment may be a boundary of the scene, and a boundarydiscrimination step of analyzing a variation of a pattern with respectto time of the measurement value calculated by the processing in themeasurement value calculation step and discriminating based on a resultof the analysis whether or not the reference segment is the boundary ofthe scene.

According to a further aspect of the present invention, there isprovided a recording medium on which a computer-readable program for AVsignal processing for detecting and analyzing a pattern which reflects asignificance structure of contents of a supplied AV signal to detect ascene of a significant break is recorded, the program including afeature amount extraction step of extracting feature amounts of segmentseach formed from a series of frames which form the AV signal, acalculation step of calculating a measurement criterion to be used formeasurement of a similarity of the feature amounts between a referencesegment and other segments, a similarity measurement step of using themeasurement criterion to measure the similarity between the referencesegment and the other segments, a measurement value calculation step ofusing the similarity measured by the processing in the similaritymeasurement step to calculate a measurement value indicative of apossibility that the reference segment may be a boundary of the scene,and a boundary discrimination step of analyzing a variation of a patternwith respect to time of the measurement value calculated by theprocessing in the measurement value calculation step and discriminatingbased on a result of the analysis whether or not the reference segmentis the boundary of the scene.

With the AV signal processing apparatus and method and the program ofthe recording medium, feature amounts of segments each formed from aseries of frames which form the AV signal are extracted, and ameasurement criterion to be used for measurement of a similarity of thefeature amounts between a reference segment and other segments iscalculated. Then, the measurement criterion is used to measure thesimilarity between the reference segment and the other segments, and themeasured similarity is used to calculate a measurement value indicativeof a possibility that the reference segment may be a boundary of thescene. Thereafter, a variation of a pattern with respect to time of themeasurement value calculated is analyzed, and it is discriminated basedon a result of the analysis whether or not the reference segment is theboundary of the scene. Therefore, a boundary of a scene can be detected,and consequently, recorded video data can be played back beginning withan arbitrary scene.

The above and other objects, features and advantages of the presentinvention will become apparent from the following description and theappended claims, taken in conjunction with the accompanying drawings inwhich like parts or elements denoted by like reference symbols.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic view showing a hierarchical model of video data;

FIG. 2 is a schematic view showing a boundary area and a non-boundaryarea of a scene;

FIG. 3 is a block diagram showing a typical configuration of avideo-audio processing apparatus to which the present invention isapplied;

FIGS. 4A and 4B are schematic views showing a boundary area betweenscenes;

FIG. 5 is a flow chart illustrating operation of the video-audioprocessing apparatus shown in FIG. 3;

FIGS. 6A to 6E are schematic views showing a typical distributionpattern of similar segments;

FIG. 7 is a diagram illustrating a result of scene detection; and

FIG. 8 is a flow chart illustrating processing of a scene detectionsection of the video-audio processing apparatus shown in FIG. 3.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

According to the present invention, video data are cut in a unit of ascene which is a set of significant segments. The term “cut” heresignifies detection of a boundary between scenes. Segments which composea scene have features unique to the scene, and therefore, if a boundarybetween adjacent scenes is passed, then the segments which compose thescene exhibit significantly different features from those of thesegments of the other scene. In other words, a place at which such anotable difference appears is a boundary between scenes, and a series ofsegments can be cut in a unit of a scene by detecting such a boundary.

Before the processing just described is performed, object video data arefirst divided in a unit of a segment similarly as in the conventionalvideo extraction techniques described hereinabove. The segments obtainedby such division form a time series, and it is necessary to discriminatewhether or not a scene boundary is present between each segment andanother segment next to the segment. Here, each segment is determined asa reference, and it is investigated at what place in time a similarsegment is present among neighboring segments.

If a scene boundary is detected, then a changing point at which apeculiar change appears in a short time from a pattern wherein similarsegments are present in a concentrated manner in the past to anotherpattern wherein similar segments are present in a concentrated manner inthe future is detected. In order to find out a place at which suchpattern change occurs, sufficient information is obtained only byinvestigating a local change around a boundary of a scene.

Further, it is also possible to measure the magnitude of the localchange to control cutting of a scene stepwise. This is because it hasbeen empirically found out that a visual changing point coincides wellwith a significant changing point of a scene. The present inventionmakes use of the foregoing to detect a boundary of a scene and cutscenes of video data or the like. Further, the present invention makesit possible for a user to see video data easily based on such sceneboundary information.

Now, an outline of the present invention is described more specifically.First, features of video data where a boundary between scenes is presentand where a boundary is not present between scenes are describedindividually. An example of particular video data is illustrated in FIG.2. Referring to FIG. 2, the video data are illustrated in a unit of asegment and includes three scenes 1 to 3. The time axis is directed inthe rightward direction in FIG. 2. An area in which no boundary ispresent is denoted as non-boundary area while an area in which aboundary is present is denoted as boundary area, and the two areas areshown in more detail in FIGS. 4A and 4B, respectively.

The video data within the time of the scene 2 is shown in thenon-boundary area of FIG. 4A and includes the segments 3 to 11 which donot include a boundary from another scene. In contrast, the boundaryarea of FIG. 4B is a time area of the segments 8 to 15 which includes aboundary area between the scene 2 and the scene 3 and in which the twoscenes are contiguous to each other.

First, features of the non-boundary area which does not include aboundary are described. Since the boundary area is composed only ofsimilar segments, where the segments are divided into those in the pastand those in the future with respect to a reference segment in thenon-boundary area, similar segments are present substantially uniformlyin the two time zones. Therefore, the distribution pattern of similarsegments does not exhibit a peculiar variation.

Different from the non-boundary area, the boundary area represents atime zone which includes a boundary point at which two scenes arecontinuous to each other. The scene here signifies a scene composed ofsegments having a high similarity to each other. Therefore, the segments8 to 11 which compose the scene 2 and the segments 12 to 15 whichcompose the different scene 3 are contiguous to each other, and thefeatures of the segments of the scenes are different across the boundarybetween the scenes.

In order to detect a boundary of a scene, it is first assumed that eachsegment is a time reference (present). Then, the detection of a boundaryof a scene can be realized by investigating the variation of thedistribution pattern with respect to time of most similar segments toeach of the segments (whether such similar segments belong to the pastor the future with respect to the reference).

More specifically, as can been from the boundary area shown in FIG. 4B,as the segments 8 to 11 are successively used as the time reference andthe time reference approaches the boundary, the ratio of those mostsimilar segments which belong to the past to those which belong to thefuture gradually increases, and immediately prior to the boundary (atthe end of the scene), the ratio becomes 100%. Then, immediately afterthe reference segment exceeds the boundary (at the top of the nextscene), conversely the ratio of those most similar segments which belongto the future to those which belong to the past exhibits 100%. Then, asthe segments 12 to 15 are successively used as the time reference, theratio described above decreases.

Accordingly, a place which is a boundary of a scene with the highestpossibility can be specified from a variation of the pattern of thedistribution ratio with respect to time of such most similar segments.Further, since the typical pattern appears with a very high possibilityat a local portion in the proximity of a boundary of a scene, only ifsegments around a boundary are checked, then the boundary can bespecified from the variation of the pattern. In other words, the timearea within which the distribution pattern of similar segments need notbe set to a greater area than a particular area.

Further, if the variation of the pattern is represented by a numericalvalue, then the degree of the variation of the value varies togetherwith the degree of a visual variation of the scene. And, it is knownempirically and based on a result of an experiment that the degree ofthe visual variation of the scene changes together with the degree of asignificant variation of the scene. Accordingly, if the numerical valuementioned above is determined as a boundary likelihood measurementvalue, then a scene corresponding to the magnitude of the significantdegree of a scene can be detected based on the magnitude of the boundarylikelihood measurement value.

Now, video data which is an object of processing of a video-audioprocessing apparatus to which the present invention is applied isdescribed.

In the present invention, it is assumed that video data of an object ofprocessing has such a modeled data structure as shown in FIG. 1 whereinit has three hierarchical layers of frame, segment and scene. Inparticular, the video data is composed of a series of frames in thelowermost hierarchical layer. Further, the video data is composed ofsegments, each of which is formed from a series of successive frames, ina higher hierarchical layer. Furthermore, the video data is composed ofscenes, each of which is formed from segments collected based on asignificant relation, in the highest hierarchical layer.

The video data usually includes both of video and audio information. Inparticular, a frame of the video data includes a video frame which is asingle still picture and an audio frame representative of audioinformation usually sampled in a short time such as several tens toseveral hundreds milliseconds/length.

Meanwhile, a video segment is formed from a series of video framespicked up successively by means of a single camera and is usually calledshot.

On the other hand, an audio segment can be defined in various manners.As one of such definitions, an audio segment is formed with a boundarydefined by a no sound period in video data detected by a method wellknown in the art. An audio segment is sometimes formed from a series ofaudio frames which are classified into a small number of categories suchas, for example, voice, music, noise, no sound and so forth as disclosedin D. Kimber and L. Wilcox, “Acoustic Segmentation for Audio Browsers”,Xerox Parc Technical Report. Further, an audio segment is sometimesdetermined based on a turning point of sound detected as a great changein a certain feature between two successive audio frames as disclosed inS. Pfeiffer, S. Fischer and E. Wolfgang, “Automatic Audio ContentAnalysis”, Proceeding of ACM Multimedia 96, November 1996, pp21-30.

A scene is based on significance of contents of video data and belongsto a higher level. A scene is subjective and relies upon contents or agenre of video data. A scene is composed of video segments or audiosegments whose features are similar to each other.

Here, a changing point is detected at which a peculiar change isexhibited from a pattern wherein segments present in the proximity ofeach segment in video data and having similar features to those of thesegment are present in a concentrated manner in the past to anotherpattern wherein segments in the proximity of each segment in video dataand having similar features are present in a concentrated manner in thefuture, and those segments from the changing point to a next point aredetermined as one scene. The reason why such patterns correspond to abreak between scenes is that similar features of the segments exhibit asignificant variation at the boundary between the scenes because thefeatures of the segments included in the scenes are different from eachother. This is much related to a significance structure at a high levelof video data, and a scene indicates such a significant mass of videodata at a high level.

Now, a typical configuration of a video-audio processing apparatus towhich the present invention is applied is described with reference toFIG. 3. The video-audio processing apparatus measures a similaritybetween segments of video data using feature amounts of the segments andcollects similar segments into scenes to automatically extract a videostructure. Thus, the video-audio processing apparatus can be applied toboth of video segments and audio segments.

The video-audio processing apparatus includes a video division section11 for dividing a stream of video data inputted thereto into videosegments, audio segments or video and audio segments, a video segmentmemory 12 for storing division information of the video data, a videofeature amount extraction section 13 for extracting feature amounts ofthe video segments, an audio feature amount extraction section 14 forextracting feature amounts of the audio segments, a segment featureamount memory 15 for storing the feature amounts of the video segmentsand the audio segments, a scene detection section 16 for collecting thevideo segments and the audio segments into scenes, and a feature amountsimilarity measurement section 17 for measuring a similarity between twosegments.

The video division section 11 divides a stream of video data inputtedthereto and including video data and audio data of various digitalformats including a compression video data format such as, for example,the MPEG (Moving Picture Experts Group) 1, the MPEG 2 or the DV (DigitalVideo) into video segments, audio segments or video and audio segments.

Where the inputted video data are of a compression format, the videodivision section 11 can process the compressed video data directlywithout decompressing them fully. The video division section 11processes the inputted video data to classify them into video segmentsand audio segments. Further, the video division section 11 outputsdivision information which is a result of division of the inputted videodata to the video segment memory 12 in the next stage. Furthermore, thevideo division section 11 outputs the division information to the videofeature amount extraction section 13 and the audio feature amountextraction section 14 in accordance with the video segments and theaudio segments.

The video segment memory 12 stores the division information of the videodata supplied thereto from the video division section 11. Further, thevideo segment memory 12 outputs the division information to the scenedetection section 16 in response to an inquiry from the scene detectionsection 16 which is hereinafter described.

The video feature amount extraction section 13 extracts feature amountsof each of the video segments obtained by the division of the video databy the video division section 11. The video feature amount extractionsection 13 can process compressed video data directly withoutdecompressing them fully. The video feature amount extraction section 13outputs the extracted feature amounts of each video segment to thesegment feature amount memory 15 in the next stage.

The audio feature amount extraction section 14 extracts feature amountsof each of the audio segments obtained by the division of the video databy the video division section 11. The audio feature amount extractionsection 14 can process compressed audio data directly withoutdecompressing them fully. The audio feature amount extraction section 14outputs the extracted feature amounts of each audio segment to thesegment feature amount memory 15 in the next stage.

The segment feature amount memory 15 stores the feature amounts of eachvideo segment and each audio segment supplied thereto from the videofeature amount extraction section 13 and the audio feature amountextraction section 14, respectively. The segment feature amount memory15 outputs the feature amounts or the segments stored therein to thefeature amount similarity measurement section 17 in response to aninquiry. from the feature amount similarity measurement section 17 whichis hereinafter described.

The scene detection section 16 uses the division information stored inthe video segment memory 12 and similarities between segments todiscriminate whether or not a video segment and an audio segment make aboundary of a scene. The scene detection section 16 specifies a changingpoint across which the distribution pattern of those neighboringsegments which are in the neighborhood of and have very similar featureamounts to those of each segment changes from that wherein such segmentsare concentrated in the past to that wherein such segments areconcentrated in the future to detect boundaries of a scene to determinea top portion and a last portion of the scene. The scene detectionsection 16 shifts the reference segment by one segment in a time serieseach time a segment is detected and measures the distribution pattern ofthose segments which are in the proximity of and most similar to thereference segment. The scene detection section 16 uses the featureamount similarity measurement section 17 to specify the number of thoseneighboring segments which are most similar to the reference segment. Inother words, the scene detection section 16 determines the number of themost neighboring feature amounts in the feature space. Then, the scenedetection section 16 specifies a boundary of a scene from a change ofthe pattern of the difference between the number of the most similarneighboring segments in the past and the number of those in the futureacross a segment.

The feature amount similarity measurement section 17 measures thesimilarity between each segment and neighboring segments. The featureamount similarity measurement section 17 issues an inquiry to thesegment feature amount memory 15 to search for feature amounts regardinga certain segment.

A video data recording section 18 records additional information datawhich is various kinds of data regarding a video stream and video data.The video data recording section 18 stores scene boundary informationoutputted from the scene detection section 16 and an intensity valuecalculated with regard to a scene.

A video display section 19 displays video data from the video datarecording section 18 using a displaying method such as a thumb naildisplaying method or a random accessing method based on variousadditional information data. This increases the degree of freedom inrecognition of video data by the user and allows convenient display ofvideo data.

A control section 20 controls a drive 21 to read out a controllingprogram stored on a magnetic disk 22, an optical disk 23, amagneto-optical disk 24 or a semiconductor memory 25 and controls thecomponents of the video-audio processing apparatus based on the thusread out controlling program.

The video-audio processing apparatus performs such a sequence ofprocesses as generally illustrated in FIG. 5 to detect a scene.

Referring to FIG. 5, the video-audio processing apparatus first performsvideo division in step S1. In particular, the apparatus divides videodata inputted to the video division section 11 into video segments oraudio segments or, if possible, into both of video and audio segments.

No particular prior condition is provided for the video dividing methodapplied by video-audio processing apparatus. For example, thevideo-audio processing apparatus may perform video division using such amethod as disclosed in G. Ahanger and T. D. C. Little, “A survey oftechnologies for parsing and indexing digital video”, J. of VisualCommunications and Image Representation 7:28-4, 1996. Such a videodividing method as just mentioned is well known in the art, and thevideo-audio processing apparatus may use any video dividing method.

Then in step S2, the video-audio processing apparatus performsextraction of feature amounts. In particular, the video-audio processingapparatus calculates feature amounts representative of features of thesegment by means of the video feature amount extraction section 13 andthe audio feature amount extraction section 14. The video-audioprocessing apparatus here calculates, for example, a time length of eachsegment, a video feature amount such as a color histogram or a texturefeature, a frequency analysis result, an audio feature amount such as alevel or a pitch, an activity measurement result and so forth asapplicable feature amounts. Naturally, the feature amounts applicable tothe video-audio processing apparatus are not limited to thosespecifically listed above.

Then in step S3, the video-audio processing apparatus performssimilarity measurement between segments using the feature amounts. Inparticular, the video-audio processing apparatus performs dissimilaritymeasurement by means of the feature amount similarity measurementsection 17 and measures based on a measurement criterion to which degreeeach segment is similar to neighboring segments. The video-audioprocessing apparatus uses the feature amounts extracted in step S2 tocalculate the dissimilarity measurement criterion.

Then in step S4, the video-audio processing apparatus discriminateswhether or not the segment corresponds to a break of a scene. Inparticular, the video-audio processing apparatus uses the dissimilaritymeasurement criterion calculated in step S3 and the feature amountscalculated in step S2 to determine, regarding each segment as thereference segment at present, in which one of the past and the futurewith respect to the reference segment the ratio of presence ofneighboring similar segments is higher, and investigates the pattern ofthe change of the ratio to discriminate whether or not the referencesegment corresponds to a boundary of a scene. The video-audio processingapparatus thus outputs whether or not each segment is a break of a scenefinally.

The video-audio processing apparatus can detect a scene from the videodata through such a sequence of processes as described above.

Accordingly, the user can use a result of the detection to summarizecontents of the video data or access an interesting point in the videodata rapidly.

Now, the sequence of processes described above is described more detailfor the individual steps.

The video division in step S1 is described first. The video-audioprocessing apparatus divides video data inputted to the video divisionsection 11 into video segments or audio segments or, if possible, intovideo and audio segments. Here, a number of techniques are available forautomatically detecting a boundary of a segment of video data, and inthe video-audio processing apparatus, no particular prior condition isprovided for the video dividing method as described hereinabove.

On the other hand, in the video-audio processing apparatus, the accuracyin scene detection by later processing essentially relies upon theaccuracy in video division. It is to be noted that scene detection bythe video-audio processing apparatus can allow some errors upon videodivision. Particularly, in the video-audio processing apparatus, videodivision is preferably performed with excessive segment detection ratherthan insufficient segment detection. As far as detection of similarsegments is performed excessively, generally segments obtained as aresult of excessive detection can be collected as the same scene uponscene detection.

Now, the feature amount detection in step S2 is described. A featureamount is an attribute of a segment which represents a feature of thesegment and provides data for measurement of a similarity betweendifferent segments. The video-audio processing apparatus calculatesfeature amounts of each segment by means of the video feature amountextraction section 13 and/or the audio feature amount extraction section14 to represent features of the segment.

Although the video-audio processing apparatus does not rely uponparticulars of any feature amount, the feature amounts which areconsidered to be effective for use with the video-audio processingapparatus may be, for example, video feature amounts, audio featureamounts and video-audio common feature amounts described below. Therequirement for such feature amounts which can be applied to thevideo-audio processing apparatus is that they allow measurement ofdissimilarity. Further, in order to assure a high efficiency, thevideo-audio processing apparatus sometimes perform the feature amountextraction and the video division described above simultaneously. Thefeature amounts described below allow such processing as just described.

The feature amounts described above include feature amounts which relateto videos. In the following description, the feature amounts whichrelate to videos are referred to as video feature amounts. Since a videosegment is formed from successive video frames, by extracting anappropriate video frame from within a video segment, contentsrepresented by the video segment can be. characterized with theextracted video frame. In particular, the similarity of a video segmentcan be replaced with the similarity of a video frame extractedappropriately. In short, a video feature amount is one of importantfeature amounts which can be used by the video-audio processingapparatus. The video feature amount by itself in this instance canmerely represent static information. However, the video-audio processingapparatus extracts a dynamic feature of a video segment based on thevideo feature amount by applying such a method as hereinafter described.

Although a large number of video feature amounts are known, since it hasbeen found out that a color feature amount (histogram) and a videocorrelation provide a good equilibrium between the calculation cost andthe accuracy to scene detection, the video-audio processing apparatususes the color feature amount and the video correlation as the videofeatures.

In the video-audio processing apparatus, a color of a video is animportant material for discrimination of whether or not two videos aresimilar to each other. Use of a color histogram for discrimination ofthe similarity between videos is well known in the art and disclosed,for example, in G Ahanger and T. D. C. Little, “A survey of technologiesfor parsing and indexing digital video”, J. of Visual Communication andImage Representation 7:28-4, 1996.

A color histogram is prepared by dividing a three-dimensional colorspace of, for example, LUV, RGB or the like into n regions andcalculating relative ratios of frequencies of appearance of pixels of avideo in the individual regions. Then, from the information obtained, ann-dimensional vector is given. From compressed video data, a colorhistogram can be extracted directly as disclosed, for example, in U.S.Pat. No. 5,708,767.

The video-audio processing apparatus thus obtains a histogram vector ofan original YUV color space of a video (of a system used commonly suchas the MPEG ½ or the DV) which composes a segment.

Specifically, the video-audio processing apparatus obtains a2^(2·3)=64-dimensional histogram vector through sampling of an originalYUV color space of a video (of a system used commonly such as the MPEG ½or the DV) which composes a segment with 2 bits per color channel.

Such a histogram as described above represents a general color tone ofthe video, but does not include time information. Therefore, thevideo-audio processing apparatus uses the video correlation as anothervideo feature amount. In scene detection by the video-audio processingapparatus, a structure of a plurality of similar segments whichintersect with each other is a convincing index that it is a singleunited scene structure.

For example, in a scene of conversation, the target of the cameraalternately moves between two talking persons, and when the camera takesthe same talking person next, it is directed back to a substantiallysame position. It has been found out that, in order to detect astructure in such a case as just described, a relation based on reducedvideos of a gray scale makes a good index to the similarity of asegment. Therefore, the video-audio processing apparatus reduces anoriginal video to a gray scale video of the size of M×N by sub-samplingand uses the gray scale video to calculate a video correlation. Here, Mand N may be sufficiently low values and, for example, 8×8. In short,such reduced gray scale videos are interpreted as MN-dimensional featureamount vectors.

Feature amounts regarding an audio are feature amounts different fromthe video feature amounts described above. In the following description,such feature amounts are referred to as audio feature amounts. An audiofeature amount is a feature amount which can represent contents of anaudio segment, and the video-audio processing apparatus can use afrequency analysis, a pitch, a level or the like as such an audiofeature amount. Such audio feature amounts are known from variousdocuments.

The video-audio processing apparatus can perform frequency analysis suchas fast Fourier transform to determine the distribution of frequencyinformation of a single audio frame. In order to represent thedistribution of frequency information, for example, over an audiosegment, the video-audio processing apparatus can use FFT (Fast FourierTransform) components, a frequency histogram, a power spectrum, acepstrum or some other feature amount.

Further, the video-audio processing apparatus can use also a pitch suchas an average pitch or a maximum pitch or an audio level such as anaverage loudness or a maximum loudness as an effective audio featureamount for representing an audio segment.

Furthermore, a video-audio common feature amount is listed as anotherfeature amount. Although the video-audio common feature particularly isneither a video feature amount nor an audio feature amount, it providesinformation useful for the video-audio processing apparatus to representa feature of a segment in a scene. The video-audio processing apparatususes a segment length and an activity as such video-audio common featureamounts.

The video-audio processing apparatus can use the segment length as avideo-audio common feature amount. The segment length is a time lengthof a segment. Generally, a scene has a rhythm feature unique to thescene. The rhythm feature appears as a variation of the segment lengthin the scene, and, for example, short segments stretched rapidlyrepresent a commercial message. Meanwhile, segments in a scene ofconversion are longer than those of a commercial message, and a scene ofconversion has a characteristic that segments combined with each otherare similar to each other. The video-audio processing apparatus can usea segment length having such characteristics as just described as avideo-audio common feature amount.

Further, the video-audio processing apparatus can use an activity as avideo-audio common feature amount. The activity is an indexrepresentative of to what degree contents of a segment are felt dynamicor static. For example, where contents of a segment are visuallydynamic, the activity represents a degree with which the camera movesrapidly along the subject or with which the object being image changesrapidly.

The activity is calculated indirectly by measuring an average value ofinter-frame dissimilarities of such feature amounts as a colorhistogram. Here, where the dissimilarity measurement criterion for thefeature amount F measured between a frame i and another frame j isd_(F)(i, j), the video activity V_(F) is defined by the followingexpression (1): $\begin{matrix}{V_{F} = \frac{\sum\limits_{i = b}^{f - 1}{d_{F}\left( {i,{i + 1}} \right)}}{f - b + 1}} & (1)\end{matrix}$where b and f are the frame numbers of the first and last frames of onesegment, respectively. The video-audio processing apparatus particularlyuses, for example, a histogram described above to calculate the activityV_(F).

While the feature amounts described above including the video featureamounts basically represent static information of a segment, in order torepresent features of a segment accurately, also dynamic informationmust be taken into consideration. Therefore, the video-audio processingapparatus represents dynamic information using such a sampling method offeature amounts as described below.

The video-audio processing apparatus extracts more than one staticfeature amount from different points of time within one segment, forexample, as seen from FIG. 5. In this instance, the video-audioprocessing apparatus determines the extraction number of feature amountsby balancing maximization of the fidelity and minimization of the dataredundancy in the segment representation. For example, where a certainone image in a segment can be designated as a key frame of the segment,a histogram calculated from the key frame is used as sample featureamounts to be extracted.

The video-audio processing apparatus uses a sampling method, which ishereinafter described, to determine which one of those samples which canbe extracted as a feature should be selected from within the objectsegment.

Here, a case wherein a certain sample is selected normally at apredetermined point of time, for example, at the last point of time in asegment, is considered. In this instance, there is the possibility that,from arbitrary two segments which are changing (fading) to a dark frame,resulting feature amounts may be the same as each other because thesamples are the same dark frame. In other words, whatever the videocontents of the segments are, the selected two frames are determined tobe very similar to each other. Such a problem as just described occursbecause the samples do not have good representative values.

Therefore, the video-audio processing apparatus do not extract a featureamount at such a fixed point as described above but extracts a statisticrepresentative value of an entire segment. Here, a popular featureamount sampling method is described in connection with two casesincluding a first case wherein feature amounts can be represented as ann-dimensional vector of a real number and a second case wherein only thedissimilarity measurement criterion can be applied. It is to be notedthat, in the first case, very well known video feature amounts and audiofeature amounts such as a histogram and a power spectrum are involved.

In the first case, the sample number is determined to be k in advance,and the video-audio processing apparatus uses a well-knownk-means-clustering method disclosed in L. Kaufman and P. J. Rousseeuw,“Finding Groups in Data: An Introduction to Cluster Analysis”,John-Wiley and sons, 1990 to automatically divide the feature amountsregarding the entire segment into groups each including k featureamounts. Then, the video-audio processing apparatus selects, from eachgroup of k samples, a sample whose sample value is equal or proximate toa centroid of the group. The complexity of the processing by thevideo-audio processing apparatus increases merely linearly in proportionto the sample number.

Meanwhile, in the second case, the video-audio processing apparatus usesa k-medoids algorithm method disclosed in L. Kaufman and P. J.Rousseeuw, “Finding Groups in Data: An Introduction to ClusterAnalysis”, John-Wiley and sons, 1990 to form groups of k samples. Then,the video-audio processing apparatus uses, as a sample value for each ofthe groups of k samples, a medoid of the group described above.

It is to be noted that, in the video-audio processing apparatus, themethod of forming a dissimilarity measurement criterion for a featureamount representative of an extracted dynamic feature is based on thedissimilarity measurement criterion for the static feature amount onwhich the dynamic feature amount is based. This, however, is hereinafterdescribed.

In this manner, the video-audio processing apparatus can extract aplurality of static feature amounts and can use a plurality of suchstatic feature amounts to represent a dynamic feature amount.

As described above, the video-audio processing apparatus can extractvarious feature amounts. Generally, each of such feature amounts is inmost cases insufficient to solely represent a feature of a segment.Therefore, the video-audio processing apparatus can combine the featureamounts suitably to select a set of feature amounts which make up foreach other. For example, by combining a color histogram and a videocorrelation described above, the video-audio processing apparatus canobtain more information than information each feature amount has.

Now, the similarity measurement between segments which uses featureamounts in step S3 of FIG. 5 is described. The video-audio processingapparatus uses the dissimilarity measurement criterion, which is afunction for calculation of a real value to measure to which degree twofeature amounts are not similar to each other, to perform similaritymeasurement of segments by means of the feature amount similaritymeasurement section 17. The dissimilarity measurement criterionindicates that, when the value thereof is low, the two feature amountsare similar to each other, but when the value thereof is high, the twofeature amounts are not similar to each other. Here, a function forcalculation of the dissimilarity of two segments S₁ and S₂ regarding thefeature amount F are defined as a dissimilarity measurement criteriond_(F)(S₁, S₂) It is to be noted that this function need satisfyrelationships given by the following expression (2):d _(F)(S ₁ , S ₂)=0 (when S ₁ =S ₂)d _(F)(S ₁ , S ₂)≧0 (for all S ₁ , S ₂)d _(F)(S ₁ , S ₂)=d _(F)(S ₂ , S ₁) (for all S ₁ , S ₂)   (2)

Although some dissimilarity measurement criterion can be applied only toa certain feature amount, generally most dissimilarity measurementcriteria can be applied to measurement of the similarity regarding afeature amount represented as a point in an n-dimensional space asdisclosed in G. Ahanger and T. D. C. Little, “A survey of technologiesfor parsing and indexing digital video”, J. of visual Communication andImage Representation 7:23-4, 1996 or in L. Kaufman and P. J. Rousseeuw,“Finding Groups in Data: An Introduction to Cluster Analysis”,John-Wiley and sons, 1990.

The Euclidean distance, the inner product, and the L1 distance areparticular examples. Here, since particularly the L1 distance actseffectively upon various feature amounts including such feature amountsas a histogram or a video correlation, the video-audio processingapparatus uses the L1 distance. Here, where two n-dimensional vectorsare represented by A and B, the L1 distance d_(L1)l(A, B) between A andB is given by the following expression (3): $\begin{matrix}{d_{L1} = {\left( {A,B} \right) = {\sum\limits_{i = 1}^{n}{{{Ai} - {Bi}}}}}} & (3)\end{matrix}$where the subscript i indicates the i-dimensional elements of then-dimensional vectors A and B.

Further, as described hereinabove, the video-audio processing apparatusextracts static feature amounts at various points of time in segments asfeature amounts representative of dynamic features. Then, in order todetermine a similarity between two extracted dynamic feature amounts, adissimilarity measurement criterion between static feature amounts onwhich the dynamic feature amounts are based as a dissimilaritymeasurement reference for the similarity. Such dissimilarity measurementcriteria for dynamic feature amounts are in most cases determined bestusing a dissimilarity value between the most similar pair of staticfeature amounts selected from the dynamic feature amounts. In thisinstance, the dissimilarity measurement criterion between two extracteddynamic feature amounts SF₁ and SF₂ is defined as given by the followingexpression (4): $\begin{matrix}{{d_{s}\left( {{SF}_{1},{SF}_{2}} \right)} = {\min\limits_{{{F1} \in {{SF}\quad 1}},{{F\quad 2} \in {{SF}\quad 2}}}{d_{F}\left( {F_{1},F_{2}} \right)}}} & (4)\end{matrix}$where the function d_(F)(F₁, F₂) indicates the dissimilarity measurementcriterion regarding the static feature amount F on which the dynamicfeature amounts SF₁ and SF₂ are based. It is to be noted that, accordingto circumstances, not the lowest value of the dissimilarity of a featureamount but the highest value or an average value may be used.

In order for the video-audio processing apparatus to determine thedissimilarity between segments, it is sometimes insufficient to use asingle feature amount and thus necessary to combine information from alarge number of feature amounts regarding the same segment. As one ofsuch methods, the video-audio processing apparatus calculates thedissimilarity based on various feature amounts as a weighted combinationof the feature amounts. In particular, where k feature amounts F₁, F₂, .. . , F_(k) are involved, the video-audio processing apparatus uses adissimilarity measurement criterion d_(F)(S₁, S₂) regarding combinedfeature amounts represented by the following expression (5):$\begin{matrix}{{d_{F}\left( {S_{1},S_{2}} \right)} = {\sum\limits_{i - 1}^{k}{w_{i}{d_{Fi}\left( {S_{1},S_{2}} \right)}}}} & (5)\end{matrix}$where w_(i) is the weighting coefficient which satisfies Σiw_(i)=1

The video-audio processing apparatus can use the feature amountsextracted in step S2 of FIG. 5 to calculate a dissimilarity measurementcriterion to measure the similarity between the segments in such amanner as described above.

Now, the cutting of a scene in step S4 of FIG. 5 is described. Thevideo-audio processing apparatus uses the dissimilarity measurementcriterion and the extracted feature amounts to detect a variation of thedistribution pattern of neighboring, most similar segments to eachsegment to discriminate whether or not the segment is at a break of ascene, and outputs a result of the discrimination. The video-audioprocessing apparatus performs the following four processes to detect ascene.

In the process (1), when each segment is determined as a reference, afixed number of most similar segments within a fixed time frame aredetected.

In the process (2), after the process (1), the ratio in number ofsimilar segments which are present in the past and in the future withrespect to the reference segment is calculated (actually the number ofsimilar segments present in the past are subtracted from the number ofsimilar segments present in the future or the like), and a result of thecalculation is determined as a boundary likelihood measurement value.

In the process (3), a variation with respect to time of the boundarylikelihood measurement values obtained by the process (2) when eachsegment is determined as a reference is examined to detect a segmentposition which indicates a pattern wherein several segments having ahigh ratio in the past successively appear and several segments having ahigh ratio in the future successively appear.

In the process (4), the absolute values of the boundary likelihoodmeasurement values in the process (3) are totaled, and the total valueis called scene intensity value. If the scene intensity value exceeds apredetermined threshold value, then the segment is determined as aboundary of a scene.

The processes are described more specifically in order with reference toFIGS. 6A to 6E. In the process (1), for example, as shown in FIG. 6A, atime frame including arbitrary k segments in the past and k segments inthe future is set for each segment (in the example shown in FIG. 6A,five segments), and N similar segments are detected from within the timeframe (in FIG. 6A, four segments) . The time advances to the future asthe number which represents each segment increases. The central segment7 in FIG. 6A indicated by slanting lines is a reference segment at acertain point of time, and similar segments to the reference segment arethe segments 4, 6, 9 and 11 indicated by reversely slanting lines. Here,four similar segments are extracted, and two similar segments arepresent in the past while two similar segments are present in thefuture.

In the process (2), the boundary likelihood measurement value iscalculated by dividing the number in the past by the number in thefuture or by subtracting the number in the future from the number in thepast. Here, the boundary likelihood measurement value is calculated bythe latter method. Here, each boundary likelihood measurement value isrepresented by Fi. i represents the position (number) of each segment.Now, by calculation according to the latter method, the boundarylikelihood measurement value F₆ of FIG. 6A is 0.

In the process (3), the calculation in the process (2) is successivelyperformed along the time axis. In FIG. 6B, it can be seen that, withreference to the segment 10 when the reference segment advances by 3segments from that in FIG. 6A, three similar segments 5, 8 and 9 arepresent in the past while one similar segment 11 is present in thefuture. The boundary likelihood measurement value F₁₀ then isF₁₀=1−3=−2.

FIG. 6C illustrates a state when the reference segment further advancesby one segment to a position immediately prior to a boundary of thescene. In the state illustrated, similar segments 6, 7, 9 and 10 to thereference segment 11 are all concentrated in the past. The boundarylikelihood measurement value F₁₁, then is F₁₁=0−4=−4.

FIG. 6D illustrates a state when the reference segment advances by onesegment from that of FIG. 6C and immediately after the reference segmentpasses the boundary and enters a new scene and thus comes to the segment12 at the top of the scene. Similar segments are segments 13, 14, 15 and16. Thus, the pattern in this instance has changed to a pattern whereinall of the similar segments are present in the future. The boundarylikelihood measurement value F₁₂ then is F₁₂=4−0=4.

Finally, FIG. 6E illustrates a state when the reference segment furtheradvances by one segment to the segment 13. Similarly, the likelihoodmeasurement value F₁₃ then is F₁₃=3−1=2. According to the presentmethod, when the ratio of similar segments in the past is higher, thesign is in the negative (minus sign) in this manner, and the positivesign (plus sign) indicates that the ratio is higher in the future. Thevariation of the boundary likelihood measurement value Fi then indicatessuch a pattern as0 . . . −2→−4→+4→+2   (6)

The position at which the change from −4 to +4 is exhibited correspondsto the boundary between the scenes. This represents that similarsegments have such a pattern that, where the reference segment and hencethe time frame is positioned intermediately of a scene as seen in FIG.6A, similar segments in the time frame are present substantiallyuniformly in the past and in the future across the reference segment,and as the reference segment approaches a boundary of the scene, theratio in which similar segments present in the past rises as seen inFIG. 6B until the ratio of similar segments present in the past comes to100% in FIG. 6C, whereafter the ratio of similar segments present in thefuture changes to 100% immediately after the reference segment passesthe boundary as seen in FIG. 6D. By detecting such a pattern as justdescribed, a changing point at which the ratio of similar segmentschanges from substantially 100% of those present in the past tosubstantially 100% of those present in the future can be determined as abreak of a scene.

Even in a non-boundary area of a scene, the ratio of similar segmentssometimes exhibits a temporary change from a high ratio of similarsegments in the past to a high ratio of similar segments in the future(for only one segment period). In most cases, however, this is not aboundary of a scene. This is because, in almost all cases, such atemporary change occurs accidentally. When a pattern is detected whereina plurality of boundary likelihood measurement values which indicatethat the ratio of similar segments present in the past is high as in anon-boundary area successively appear first and then a plurality ofboundary likelihood measurement values which indicate that the ratio ofsimilar segments present in the future successively appear, it isdiscriminated that the reference segment is a boundary of a scene with ahigh degree of possibility. In any other case, the reference segment isnot a boundary of a scene with a high possibility, and therefore, it isnot determined as a boundary of a scene.

In the process (4), after the process (3), the boundary likelihoodmeasurement values are totaled to calculate the “intensity” of the sceneboundary point. In order to measure the intensity, the absolute valuesof the boundary likelihood measurement values are added. The degree ofthe variation of the value of the intensity corresponds to the degree ofthe visual variation between the scenes, and the degree of the visualvariation between the scenes corresponds to the degree of thesignificance variation. Accordingly, a scene corresponding to themagnitude of the significance degree of a scene can be detecteddepending upon the magnitude of the value.

Here, the total value of the absolute values is defined as sceneintensity value V_(i). In the definition, i represents the number of thesegment. For example, the total value of the absolute values of fourboundary likelihood measurement values (for each segment, boundarylikelihood measurement values F_(i−2), F_(i−1), F_(i), F_(i+i) of foursegments including two segments in the past, one segment in the futureand the segment) are used.

It is considered that, in the pattern of the variation of the boundarylikelihood measurement value at a boundary of a scene, a variationoccurs from a case wherein similar segments are present by 100% in thepast to another case wherein similar segments are present by 100% in thefuture like the value −4 of F_(i−1)→value +4 of F_(i) as givenhereinabove.

In this manner, a great change occurs in a one-segment distance on theboundary between scenes. Then, the possibility that a variation inpattern may occur while the absolute value of the boundary likelihoodmeasurement value remains high over four or more segments like thepattern of the expression (6) is not high except in the proximity of aboundary of a scene. From the characteristic of the variation inpattern, a desired scene can be detected by discriminating only a placeat which the scene intensity value Vi is equal to or higher than acertain level as an actual boundary of a scene.

FIG. 7 illustrates a graph of a result of use of video data forapproximately 30 minutes of an actual music program. The axis ofordinate represents the scene intensity value, and the axis of abscissarepresents segments. Each segment represented by a bar with slantinglines is an actual boundary of a scene (here, the segment is the topsegment of a scene). In the result illustrated, if a segment at whichthe scene intensity value is equal to or higher than 12 is determined asa boundary of a scene, then the scenes coincide with actual scenes withthe probability of 6/7.

A flow of operations described above is described with reference to FIG.8. The flow of operations described here is performed by the scenedetection section 16 of the video-audio processing apparatus, and thefollowing processing is performed each time a segment is produced.

In step S11, the video-audio processing apparatus detects, for eachsegment, N neighboring similar segments within a range of ±k segmentscentered at the segment using the feature amount similarity measurementsection 17 and determines the numbers of those similar segments whichare present in the past and those similar elements which are present inthe future.

In step S12, the number of those similar segments of the N similarsegments determined by the processing in step S11 which are present inthe past is subtracted from the number of those similar segments whichare present in the future is determined as the boundary likelihoodmeasurement value F_(i) for each segment, and the boundary likelihoodmeasurement values F_(i) determined in this manner are stored.

In step S13, a place which is a boundary of a scene with a highpossibility is specified from a variation of the pattern of the boundarylikelihood measurement values F_(i−n), . . . , F_(i), F_(i+n) of 2nsegments. n is the number of boundary likelihood measurement valuessufficient to detect a pattern change between the ratio in the past andthe ratio in the future from i segments.

Here, three requirements for a variation pattern which suggests aboundary of a scene are defined in the following manner:

-   -   (1) None of boundary likelihood measurement value of F_(i−n) to        F_(i+n) is equal to 0;    -   (2) The values of F_(i−n) to F_(i−1) are all lower than 0; and    -   (3) The values of F_(i−n) to F_(i−1) are all higher than 0.

Then, it is discriminated whether or not all of the three requirementsgiven above are satisfied. If all of the requirements are satisfied,then it is discriminated that the place is a boundary of a scene with ahigh possibility, and the processing advances to next step S14. In anyother case, the processing advances to step S16.

In step S14, the boundary likelihood measurement values obtained in stepS13 are applied to the following expression to calculate the sceneintensity V_(i) from the boundary likelihood measurement values F_(i−n),. . . , F_(i), . . . , F_(i+n):V _(i) =|F _(i−n) |+ . . . +|F _(i−1) |+|F _(i) |+ . . . +|F _(i+n)|

Then, if a requirement that a threshold value for an intensity valuemust be exceeded is provided, then if a scene intensity value whichsatisfies the requirement appears, then it is determined that it is anintensity of a visual change of the scene to be determined, and theposition of the segment is outputted as one of boundaries of scenes ofthe video data being processed. Where the requirement regarding anintensity value is not required, the intensity value regarding eachsegment is outputted and recorded as additional information data to andinto the video data recording section 18.

The processing described above is repeated to successively detectboundaries of scenes. A scene is formed from a group of segmentsincluded in a range from one to another one of the boundaries.

As described above, the video-audio processing apparatus to which thepresent invention is applied extracts a scene structure. It has beenproved already through experiments that the series of processes of thevideo-audio processing apparatus described above can be applied toextract a scene structure from video data of various contents such as atelevision drama or a movie.

It is to be noted that, according to the present invention, the numberof boundaries of scenes can be adjusted by arbitrarily changing thescene intensity value. Therefore, by adjusting the scene intensityvalue, boundary detection of a scene adapted better to various contentscan be anticipated.

Further, in order to make it easy to look at videos at a glance, thenumber of scenes obtained can be made as small as possible. Therefore,where the number of detected scenes is limited, this gives rise to a newproblem of which scenes should be shown. Thus, if the significance ofeach of the obtained scenes is known, then it is desirable to show thescenes in accordance with the order in significance. The presenttechnique provides a scene intensity value which is a scale formeasurement of to which degree an obtained scene is significant and thusallows the number of scenes to be changed by changing the scale(changing the scene intensity threshold value). Thus, the presentinvention provides a convenient representation for enjoyment in responseto the interest of the user.

Besides, when the number of scenes is to be changed, it is not necessaryto perform the scene detection process again, and the intensity valuetime series stored can be processed simply only by changing the sceneintensity threshold value.

As described above, the present invention solves all problems of theprior art described hereinabove.

First, according to the video-audio processing apparatus, the user neednot know a significance structure of video data in advance.

Further, the processing performed for each segment by the video-audioprocessing apparatus includes the following items:

-   -   (1) To extract a feature amount;    -   (2) To measure a dissimilarity between a pair of segments in a        time area which includes a fixed number of segments;    -   (3) To use a result of the dissimilarity measurement to extract        a fixed number of sufficiently similar segments;    -   (4) To calculate a measurement likelihood measurement value from        a ratio of presence of similar segments; and    -   (5) To use the boundary likelihood measurement value to        determine an intensity value of a scene boundary point.

The processes described have a low load upon calculation. Therefore, theprocessing can be applied to electronic apparatus for domestic use suchas a set top box, a digital video recorder or a home server.

Further, the video-audio processing apparatus can provide, as a resultof detection of a scene, a basis for a new high level access for videobrowsing. Therefore, the video-audio processing apparatus allows easyaccessing to video data based on contents by visualizing the contents ofthe video data using a video structure of a high level not of a segmentbut of a scene. For example, where the video-audio processing apparatusdisplays a scene, the user can recognize a subject matter of the programrapidly and can find out a portion of the program which is interestingto the user.

Further, according to the video-audio processing apparatus, since ascene is detected, a basis for automatically producing an outline or anabstract of video data is obtained. Generally, in order to produce aconsistent abstract, it is necessary not to combine random fractionsfrom video data but to decompose video data into reproduciblesignificant components. A scene detected by the video-audio processingapparatus provides a basis for production of such an abstract as justdescribed.

It is to be noted that the present invention is not limited to theembodiment described above, and naturally, for example, the featureamounts for use for similarity measurement between segments and so forthmay be different from those given hereinabove. Further, it is a matterof course that the embodiment described above can be modified suitablywithout departing from the spirit and scope of the present invention.

Furthermore, according to the present invention, a scene which is asignificant changing point on a contents structure is obtained byarbitrarily changing the scene intensity value. This is because theintensity value can correspond to the degree of the variation ofcontents. In particular, when a video is to be accessed, the number ofdetection scenes can be controlled by adjusting the scene intensitythreshold value. Besides, it is possible to increase or decrease thenumber of scenes, whose contents should be displayed, in accordance withan object.

In short, the so-called accessing granularity of contents can becontrolled freely in accordance with an object. For example, when avideo is to be enjoyed for a certain one hour, the intensity value isset to a high value first to show a short abstract including a scene orscenes which are significant for the contents. Then, if the user is moreinterested in and wants to see the contents more particularly, then theintensity value is lowered so that another abstract formed from a finerscene or scenes can be displayed. Besides, where the method of thepresent invention is applied, different from the prior art, detectionneed not be performed again each time the intensity value is adjusted,but only it is required to perform processing of a stored intensityvalue time series simply.

Further, where the video-audio processing apparatus is applied todomestic apparatus such as a set top box or a digital video recorder,the following advantages can be anticipated.

The first advantage is that, since scene detection of the presentinvention can be realized by investigating a local change of similarsegments to each segment, the number of segments to be investigated canbe fixed to a fixed number. Therefore, the memory capacity necessary forthe processing can be fixed, and the video-audio processing apparatuscan be incorporated also in an apparatus for domestic use such as a settop box or a digital recorder which has a comparatively small memorycapacity.

The second advantage is that, as described above in the first advantage,the process for detecting a scene is realized by successively processinga predetermined number of segments. This allows real-time processingwherein the time required for each segment is fixed. This is suitablefor an apparatus for domestic use such as a set top box or a digitalrecorder wherein a predetermined process must be completed without failwithin a predetermined time.

The third advantage is that, since the processing for scene detectionsuccessively processes a predetermined number of segments for eachsegment as described hereinabove, sequential processing whereinprocessing for a new segment is performed each time the processing forone segment is completed is possible. This makes it possible to end,when recording of a video signal or the like is ended with an apparatusfor domestic use such as a set top box or a digital recorder, theprocessing substantially simultaneously with the ending time of therecording. Further, even if the recording condition is stopped by somereason, it is possible to keep the record till then.

While the series of processes described above can be executed byhardware, it may otherwise be executed by software. Where the series ofprocesses is executed by software, a program which constructs thesoftware is installed from a recording medium into a computerincorporated in hardware for exclusive use or, for example, a personalcomputer for universal use which can execute various functions byinstalling various programs.

The recording medium may be formed as a package medium such as, as shownin FIG. 3, a magnetic disk 22 (including a floppy disk), an optical disk23 (including a CD-ROM (Compact Disc-Read Only Memory) and a DVD(Digital Versatile Disk)), a magneto-optical disk 43 (including an MD(Mini-Disc)), or a semiconductor memory 25 which has the programrecorded thereon or therein and is distributed in order to provide theprogram to a user separately from a computer, or as a ROM or a hard diskwhich has the program recorded therein or thereon and is provided to auser in a form wherein it is incorporated in a computer.

It is to be noted that, in the present specification, the steps whichdescribe the program recorded in or on a recording medium may be butneed not necessarily be processed in a time series in the order asdescribed, and include processes which are executed parallelly orindividually without being processed in a time series.

Further, in the present specification, the term “system” is used torepresent an entire apparatus composed of a plurality of apparatus.

While a preferred embodiment of the invention has been described usingspecific terms, such description is for illustrative purposes only, andit is to be understood that changes and variations may be made withoutdeparting from the spirit or scope of the following claims.

1-12. (canceled)
 13. An AV signal processing apparatus for detecting aboundary between scenes, comprising: feature amount extraction means forextracting feature amounts of segments each formed from a series offrames which form an AV signal; similarity measurement means formeasuring a similarity between a segment and other segments in apredetermined time domain between a past time and a future time usingsaid feature amounts; similar segments detection means for detectingsimilar segments according to said similarity for each segment in saidpredetermined time domain; similar segments counting means for countingsaid similar segments in said past and said future in said predeterminedtime domain for each segment; boundary likelihood measurementcalculation means for calculating boundary likelihood measurement valueaccording to a counted amount of said similar segments in saidpredetermined time domain for each segment; pattern detection means fordetecting a pattern of existence of said boundary likelihood measurementvalues in said predetermined time domain; and boundary discriminationmeans for discriminating a boundary of a scene according to saidpattern.
 14. The AV signal processing apparatus according to claim 13,wherein said AV signal includes at least one of a video signal and anaudio signal.
 15. The AV signal processing apparatus according to claim14, further comprising audio segment production means or detecting, whenthe AV signal includes a video signal, a shot which is a basic unit of avideo segment to produce an audio segment.
 16. The AV signal processingapparatus according to claim 14, further comprising audio segmentproduction means for using, when the AV signal includes an audio signal,at least one of the feature amounts of the audio signal and a no soundperiod to produce an audio segment.
 17. The AV signal processingapparatus according to claim 14, wherein the feature amounts of thevideo signal at least include a color histogram.
 18. The AV signalprocessing apparatus according to claim 14, wherein the feature amountsof the video signal at least include at least one of a sound volume anda spectrum.
 19. The AV signal processing apparatus according to claim13, wherein said boundary discrimination means compares the measurementvalue with a preset threshold value to discriminate whether or not areference segment is a boundary of the scene.
 20. An AV signalprocessing apparatus for detecting a boundary between scenes,comprising: an extractor operable to extract a feature amount ofsegments each formed from a series of frames which form an AV signal; asimilarity measurer operable to measure a similarity between a segmentand other segments in a predetermined time domain between a past timeand a future time using said feature amounts; a detector operable todetect similar segments according to said similarity for each segment insaid predetermined time domain; a counter operable to count similarsegments in said past and said future in said predetermined time domainfor each segment; a boundary likelihood measurer operable to calculate aboundary likelihood measurement value according to a counted amount ofsaid similar segments in said predetermined time domain for eachsegment; a detector operable to detect a pattern of existence of saidboundary likelihood measurement values in said predetermined timedomain; and a discriminator operable to discriminate a boundary of ascene according to said pattern.
 21. The AV signal processing apparatusaccording to claim 20, wherein said AV signal includes at least one of avideo signal and an audio signal.
 22. The AV signal processing apparatusaccording to claim 21, further comprising a detector operable to detect,when the AV signal includes a video signal, a shot which is a basic unitof a video segment to produce an audio segment.
 23. The AV signalprocessing apparatus according to claim 21, further comprising adetector operable to determine, when the AV signal includes an audiosignal, at least one of the feature amounts of the audio signal and a nosound period to produce an audio segment.
 24. The AV signal processingapparatus according to claim 21, wherein the feature amounts of thevideo signal at least include a color histogram.
 25. The AV signalprocessing apparatus according to claim 21, wherein the feature amountsof the video signal at least include at least one of a sound volume anda spectrum.
 26. The AV signal processing apparatus according to claim20, wherein said discriminator compares the measurement value with apreset threshold value to discriminate whether or not a referencesegment is a boundary of the scene.
 27. A method of detecting a boundarybetween scenes in an AV signal comprising the steps of: extractingfeature amounts of segments each formed from a series of frames whichform an AV signal; measuring a similarity between a segment and othersegments in a predetermined time domain between a past time and a futuretime using said feature amounts; detecting similar segments according tosaid similarity for each segment in said predetermined time domain;counting said similar segments in said past and said future in saidpredetermined time domain for each segment; calculating boundarylikelihood measurement value according to a counted amount of saidsimilar segments in said predetermined time domain for each segment;detecting a pattern of existence of said boundary likelihood measurementvalues in said predetermined time domain; and discriminating a boundaryof a scene according to said pattern.
 28. The method of claim 27,wherein said AV signal includes at least one of a video signal and anaudio signal.
 29. The method of claim 28, further comprising a step of:detecting, when the AV signal includes a video signal, a shot which is abasic unit of a video segment, and producing an audio segment.
 30. Themethod of claim 28, further comprising a step of: producing an audiosegment, when the AV signal includes an audio signal, by using at leastone of the feature amounts of the audio signal and a no sound period.31. The method of claim 28, wherein the feature amounts of the videosignal include at least one color histogram.
 32. The method of claim 28,wherein the feature amounts of the video signal include at least one ofa sound volume and a spectrum.
 33. The method of claim 27, wherein saidstep of discriminating comprises: comparing the measurement value with apreset threshold value; and discriminating whether or not a referencesegment is a boundary of the scene.
 34. A recording medium havingrecorded thereon a program for detecting a boundary between scenes in anAV signal said program describing steps of: extracting feature amountsof segments each formed from a series of frames which form an AV signal;measuring a similarity between a segment and other segments in apredetermined time domain between a past time and a future time usingsaid feature amounts; detecting similar segments according to saidsimilarity for each segment in said predetermined time domain; countingsaid similar segments in said past and said future in said predeterminedtime domain for each segment; calculating boundary likelihoodmeasurement value according to a counted amount of said similar segmentsin said predetermined time domain for each segment; detecting a patternof existence of said boundary likelihood measurement values in saidpredetermined time domain; and discriminating a boundary of a sceneaccording to said pattern.
 35. The recording medium of claim 34, whereinsaid AV signal includes at least one of a video signal and an audiosignal.
 36. The recording medium of claim 35, wherein said programfurther describes the steps of: detecting, when the AV signal includes avideo signal, a shot which is a basic unit of a video segment, andproducing an audio segment.
 37. The recording medium of claim 35,wherein said program further describes the steps of: producing an audiosegment, when the AV signal includes an audio signal, by using at leastone of the feature amounts of the audio signal and a no sound period.38. The recording medium of clam 35, wherein the feature amounts of thevideo signal include at least one color histogram.
 39. The recordingmedium of claim 35, wherein the feature amounts of the video signalinclude at least one of a sound volume and a spectrum.
 40. The recordingmedium of claim 34, wherein said step of discriminating comprises:comparing the measurement value with a preset threshold value; anddiscriminating whether or not a reference segment is a boundary of thescene.
 41. A computer program embodied on a computer readable medium,for detecting a boundary between scenes in an AV signal said programdescribing steps of: extracting feature amounts of segments each formedfrom a series of frames which form an AV signal; measuring a similaritybetween a segment and other segments in a predetermined time domainbetween a past time and a future time using said feature amounts;detecting similar segments according to said similarity for each segmentin said predetermined time domain; counting said similar segments insaid past and said future in said predetermined time domain for eachsegment; calculating boundary likelihood measurement value according toa counted amount of said similar segments in said predetermined timedomain for each segment; detecting a pattern of existence of saidboundary likelihood measurement values in said predetermined timedomain; and discriminating a boundary of a scene according to saidpattern.
 42. The computer program of claim 41, wherein said AV signalincludes at least one of a video signal and an audio signal.
 43. Thecomputer program of claim 42, further describing the steps of:detecting, when the AV signal includes a video signal, a shot which is abasic unit of a video segment, and producing an audio segment.
 44. Thecomputer program of claim 42, further describing the steps of: producingan audio segment, when the AV signal includes an audio signal, by usingat least one of the feature amounts of the audio signal and a no soundperiod.
 45. The computer program of claim 42, wherein the featureamounts of the video signal include at least one color histogram. 46.The computer program of claim 42, wherein the feature amounts of thevideo signal include at least one of a sound volume and a spectrum. 47.The computer program of claim 41, wherein said step of discriminatingcomprises: comparing the measurement value with a preset thresholdvalue; and discriminating whether or not a reference segment is aboundary of the scene.